Search CORE

2 research outputs found

FRASIMED: a Clinical French Annotated Resource Produced through Crosslingual BERT-Based Annotation Projection

Author: Aananou Soukaïna
Bjelogrlic Mina
Gaudet-Blavignac Christophe
Goldman Jean-Philippe
Lovis Christian
Zaghir Jamil
Publication venue
Publication date: 19/09/2023
Field of study

Natural language processing (NLP) applications such as named entity recognition (NER) for low-resource corpora do not benefit from recent advances in the development of large language models (LLMs) where there is still a need for larger annotated datasets. This research article introduces a methodology for generating translated versions of annotated datasets through crosslingual annotation projection. Leveraging a language agnostic BERT-based approach, it is an efficient solution to increase low-resource corpora with few human efforts and by only using already available open data resources. Quantitative and qualitative evaluations are often lacking when it comes to evaluating the quality and effectiveness of semi-automatic data generation strategies. The evaluation of our crosslingual annotation projection approach showed both effectiveness and high accuracy in the resulting dataset. As a practical application of this methodology, we present the creation of French Annotated Resource with Semantic Information for Medical Entities Detection (FRASIMED), an annotated corpus comprising 2'051 synthetic clinical cases in French. The corpus is now available for researchers and practitioners to develop and refine French natural language processing (NLP) applications in the clinical field (https://zenodo.org/record/8355629), making it the largest open annotated corpus with linked medical concepts in French

arXiv.org e-Print Archive

FRASIMED: a Clinical French Annotated Resource Produced through Crosslingual BERT-Based Annotation Projection

Author: Aananou Soukaïna
Bjelogrlic Mina
Gaudet-Blavignac Christophe
Goldman Jean-Philippe
Lovis Christian
Zaghir Jamil
Publication venue
Publication date: 18/09/2023
Field of study

The French Annotated Resource with Semantic Information for Medical Entities Detection (FRASIMED) contains 2'051 synthetic clinical cases in French, with 24'037 annotated entities. The dataset contains two subsets: CANTEMIST-FR: Originally from CANTEMIST (Miranda-Escalada et al. (2020)), it contains 1'301 oncological notes, with 15'978 annotations linked to an ICD-O-3.1 morphology code. Additionally, 15’457 of them are linked to a SNOMED-CT code. DISTEMIST-FR: Originally from DISTEMIST's training set (Miranda-Escalada et al. (2022)), it contains 750 clinical cases, with 8'059 annotations, with 5'132 of them linked to a SNOMED-CT code. Please, cite us: Zaghir, J., Bjelogrlic, M., Goldman, J.-P., Aananou, S., Gaudet-Blavignac, & Lovis, C. (2023). FRASIMED: a Clinical French Annotated Resource Produced through Crosslingual BERT-Based Annotation Projection. arXiv preprint http://arxiv.org/abs/2309.1077

ZENODO